Cross-Lingual Semantic Similarity Measure for Comparable Articles

نویسندگان

  • Motaz Saad
  • David Langlois
  • Kamel Smaïli
چکیده

We aim in this research to find and compare cross-lingual articles concerning a specific topic. So, we need a measure to compare articles. This measure can be based on bilingual dictionary or based on numerical methods such as Latent Semantic Indexing (LSI). In this paper, we use LSI in two ways to retrieve Arabic-English comparable articles. The first one is monolingual: the English article is translated into Arabic and then mapped into the LSI Arabic space; the second one is cross-lingual: Arabic and English documents are mapped into LSI Arabic-English space. Then, we compare LSI approaches to the dictionary-based approach on several English-Arabic parallel and comparable corpora. Results indicate that the performance of cross-lingual LSI approach is competitive to monolingual approach, or even better for some corpora. Moreover both LSI approaches outperform dictionary approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Models of Cross-Lingual Semantic Similarity in Context Based on Latent Cross-Lingual Concepts Induced from Comparable Data

We propose the first probabilistic approach to modeling cross-lingual semantic similarity (CLSS) in context which requires only comparable data. The approach relies on an idea of projecting words and sets of words into a shared latent semantic space spanned by language-pair independent latent semantic concepts (e.g., crosslingual topics obtained by a multilingual topic model). These latent cros...

متن کامل

An English-Chinese Cross-lingual Word Semantic Similarity Measure Exploring Attributes and Relations

Word semantic similarity measuring is a fundamental issue to many NLP applications and the globalization has made an urgent request for cross-lingual word similarity measure. This paper proposed a word semantic similarity measure which is able to work in cross-lingual scenarios. Basically, a concept can be defined by a set of attributes. The basic idea of this work is to compute the similarity ...

متن کامل

English-Persian Plagiarism Detection based on a Semantic Approach

Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...

متن کامل

NTHU at NTCIR-10 CrossLink-2: An Approach toward Semantic Features

This paper describes the approaches of NTHU in the NTCIR-10 Cross-Lingual Link Discovery task, also named CrossLink-2. In this task, we aim to discover valuable anchors in Chinese, Japanese or Korean (CJK) articles and to link these anchors to related English Wikipedia pages. To achieve the objective, we do not only depend on Wikipedia’s distinguishing features (e.g. anchor links information an...

متن کامل

A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles

Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and CrossLanguage Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a compar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014